perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160
perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160JC-ut0 wants to merge 2 commits intojd-opensource:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for hybrid attention models (such as qwen3_next) by differentiating between full attention and linear (GDN) attention layers during KV cache estimation and allocation. Key changes include updating LLMEngine and RecEngine to calculate cache capacity based on specific layer types, adding logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactoring WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer. Review feedback highlights the need for better consistency across the engine by utilizing the centralized is_full_attention_layer helper function to avoid logic errors related to default attention intervals and potential division-by-zero issues.
| torch::dtype(dtype_).device(device_)), | ||
| 2); | ||
| } | ||
| #elif defined(USE_ILU) || defined(USE_MLU) || defined(USE_MUSA) |
There was a problem hiding this comment.
The #elif defined(USE_ILU) || defined(USE_MLU) || defined(USE_MUSA) and #else branches appear to have identical behavior — what's the reason for splitting them?
There was a problem hiding this comment.
This follows the original code style below.
Removed unused layer types variable from worker_impl.cpp
Add logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactor WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer.